# Cross-modal Alignment
Vit So400m Patch16 Siglip 256.webli I18n
Apache-2.0
A vision Transformer model based on SigLIP, focusing on image feature extraction with original attention pooling mechanism.
Image Classification
Transformers

V
timm
15
0
Vit Large Patch14 Clip 224.datacompxl
Apache-2.0
A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.
Image Classification
Transformers

V
timm
14
0
Mblip Bloomz 7b
MIT
mBLIP is a multilingual vision-language model based on the BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text
Transformers Supports Multiple Languages

M
Gregor
21
1
Mblip Mt0 Xl
MIT
mBLIP is a multilingual vision-language model based on BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.
Image-to-Text
Transformers Supports Multiple Languages

M
Gregor
374
14
Featured Recommended AI Models